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Abstract 

We present simple and computationally efficient nonparametric estimators of 
Renyi entropy and mutual information based on an i.i.d. sample drawn from an 
unknown, absolutely continuous distribution over M. d . The estimators are cal- 
culated as the sum of p-th powers of the Euclidean lengths of the edges of the 
'generalized nearest-neighbor' graph of the sample and the empirical copula of 
the sample respectively. For the first time, we prove the almost sure consistency 
of these estimators and upper bounds on their rates of convergence, the latter of 
which under the assumption that the density underlying the sample is Lipschitz 
continuous. Experiments demonstrate their usefulness in independent subspace 
analysis. 



1 Introduction 



We consider the nonparametric problem of estimating Renyi a-entropy and mutual information (MI) 
based on a finite sample drawn from an unknown, absolutely continuous distribution over R d . There 
are many applications that make use of such estimators, of which we list a few to give the reader 
a taste: Entropy estimators can be used for goodness-of-fit testing ( Vasicek 



2009 



1976 Goria et al. 



2005), parameter estimation in semi-parametric models (Wolsztynski et al. 2005 i, studying fracta 



random walks ( jAlemany and Zanette| |T994 ), and texture classification ( Hero et aL]|2002b|a| l. Mu 
tual information estimators have been used in feature selection (Peng and Ding 2005), clustering 
( Aghagolz adeh et aL] 2007), causal ity detection (|Hlavackova-Schindler etaL 2007\ , optimal exper- 



imental design (Lewi et al. 2007 Poczos and Lorinczf 2009| l, fMRI data processing (Chai et al. 



prediction of protein structures (Adami 2004T7or boosting and facial expression recogni 



tion ( Shan et al. , 2005 ). Both entropy estimators and mutual information estimators have been used 



for independent component and subspace analysis ( Learned-Miller and Fisher] |2003 
L6rincz| [2005] |Hulle| |20"08} |Szab6 eTal) |2007]>, and image registration (|Kybic[l2006 



Lonnczi \. 
2002b|a) . 



For further applications, see Leonenko et al. (2008); Wang et al. (2009a|l. 



Poczos and 



Hero et al. 



In a naive approach to Renyi entropy and mutual information estimation, one could use the so called 
"plug-in" estimates. These are based on the obvious idea that since entropy and mutual information 
are determined solely by the density / (and its marginals), it suffices to first estimate the density 
using one's favorite density estimate which is then "plugged-in" into the formulas defining entropy 
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and mutual information. The density is, however, a nuisance parameter which we do not want to 
estimate. Density estimators have tunable parameters and we may need cross validation to achieve 
good performance. 

The entropy estimation algorithm considered here is direct — it does not build on density estimators. 
It is based on fc-nearest-neighbor (NN) graphs with a fixed k. A variant of these estimators, where 



each sample point is connected to its fc-th nearest neighbor only, were recently studied by Goria 



et al. ( 2005) 1 for Shannon entropy estimation (i.e. the special case a = 1) and |Leonenko et aL 
(2008) for Renyi a-entropy estimation. They proved the weak consistency of their estimators under 



certain conditions. However, their proofs contain some errors, and it is not obvious how to fix them. 
Namely, Leonenko et al.| ( 2008| l apply the generalized Helly-Bray theorem, while Goria et al. ( 2005 1 
apply the inverse Fatou lemma under conditions when these theorems do not hold. This latter error 
originates from the article of Kozachenko and Leonenko ( 1987 1, and this mistake can also be found 



in Wang etal. (2009b i. 



The first main contribution of this paper is to give a correct proof of consistency of these estimators. 
Employing a very different proof techniques than the papers mentioned above, we show that these 
estimators are, in fact, strongly consistent provided that the unknown density / has bounded support 
and a £ (0, 1). At the same time, we allow for more general nearest-neighbor graphs, wherein 
as opposed to connecting each point only to its fc-th nearest neighbor, we allow each point to be 
connected to an arbitrary subset of its k nearest neighbors. Besides adding generality, our numer- 
ical experiments seem to suggest that connecting each sample point to all its k nearest neighbors 
improves the rate of convergence of the estimator. 

The second major contribution of our paper is that we prove a finite-sample high-probability bound 
on the error (i.e. the rate of convergence) of our estimator provided that / is Lipschitz. According 
to the best of our knowledge, this is the very first result that gives a rate for the estimation of Renyi 
entropy. The closest to our result in this respect is the work by Tsybakov and van der Meulen 
1996) who proved the root-n consistency of an estimator of the Shannon entropy and only in one 



dimension. 

The third contribution is a strongly consistent estimator of Renyi mutual information that is based on 
NN graphs and the empirical copula transformation (Dedecker et al. , 2007). Th is result is pro ved for 
d > 3p]and a £ (1/2, 1). This builds upon and extends the previous work of |P6czos et al.| (|2010[) 



where instead of NN graphs, the minimum spanning tree (MST) and the shortest tour through the 
sample (i.e. the traveling salesman problem, TSP) were used, but it was only conjectured that NN 
graphs can be applied as well. 

There are several advantages of using fc-NN graph over MST and TSP (besides the obvious concep- 
tual simplicity of fc-NN): On a serial computer the fc-NN graph can be computed somewhat faster 
than MST and much faster than the TSP tour. Furthermore, in contrast to MST and TSP, computa- 
tion of fc-NN can be easily parallelized. Secondly, for different values of a, MST and TSP need to 
be recomputed since the distance between two points is the p-th power of their Euclidean distance 
where p — d(l — a). However, the fc-NN graph does not change for different values of p, since 
p-th power is a monotone transformation, and hence the estimates for multiple values of a can be 
calculated without the extra penalty incurred by the recomputation of the graph. This can be advan- 
tageous e.g. in intrinsic dimension estimators of manifolds (Costa an d Hero||2003| l, where p is a free 
parameter, and thus one can calculate the estimates efficiently for a few different parameter values. 

The fourth major contribution is a proof of a finite-sample high-probability error bound (i.e. the rate 
of convergence) for our mutual information estimator which holds under the assumption that the 
copula of / is Lipschitz. According to the best of our knowledge, this is the first result that gives a 
rate for the estimation of Renyi mutual information. 

The toolkit for proving our results derives from the deep literature of Euclidean functionals, see, 
(Steele 1997||Yukich 1998| l. In particular, our strong consistency result uses a theorem due to Red 
mond and Yukich (1996) that essentially states that any quasi-additive power-weighted Euclidean 
functional can be used as a strongly consistent estimator of Renyi entropy (see also Hero and Michel 
1999 1. We also make use of a result due to Koo and Lee| ( |2007) , who proved a rate of convergence 
result that holds under more stringent conditions. Thus, the main thrust of the present work is show- 



Our result for Renyi entropy estimation holds for d — 1 and d — 2, too. 
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ing that these conditions hold for p-power weighted nearest-neighbor graphs. Curiously enough, up 
to now, no one has shown this, except for the case when p = 1, which is studied in Section 8.3 of 
( Yukich 1998 1. However, the condition p = 1 gives results only for a = 1 — 1/d. 



All proofs and supporting lemmas can be found in the appendix. In the main body of the paper, 
we focus on clear explanation of Renyi entropy and mutual information estimation problems, the 
estimation algorithms and the statements of our converge results. 

Additionally, we report on two numerical experiments. In the first experiment, we compare the 
empirical rates of convergence of our estimators with our theoretical results and plug-in estimates. 
Empirically, the NN methods are the clear winner. The second experiment is an illustrative applica- 
tion of mutual information estimation to an Independent Subspace Analysis (ISA) task. 

The paper is organized as follows: In the next section, we formally define Renyi entropy and Renyi 
mutual information and the problem of their estimation. Section[3]explains the 'generalized nearest 
neighbor' graphs. This graph is then used in Section [4] to define our Renyi entropy estimator. In 
the same section, we state a theorem containing our convergence results for this estimator (strong 
consistency and rates). In Section [5] we explain the copula transformation, which connects Renyi 
entropy with Renyi mutual information. The copula transformation together with the Renyi entropy 
estimator from Section |4] is used to build an estimator of Renyi mutual information. We conclude 
this section with a theorem stating the convergence properties of the estimator (strong consistency 
and rates). Section [6] contains the numerical experiments. We conclude the paper by a detailed 
discussion of further related work in Section|7] and a list of open problems and directions for future 
research in Section[8] 



2 The Formal Definition of the Problem 

Renyi entropy and Renyi mutual information of d real-valued random variable^] X = 
(X 1 , X 2 , . . . , X d ) with joint density / : R d ->• K and marginal densities : K ->• R, 1 < i < d, 
are defined for any real parameter a assuming the underlying integrals exist. For a / 1, Renyi 
entropy and Renyi mutual information are defined respectively a^] 



H Q (X) = H Q (f) = 7— — log / t{x\ 



x 2 ,...,x a )d{x 1 ,x',...,x a ) , (1) 




7 (X) = /„(/) -—log J^J a {x\x 2 ,...,x d ) [ ] [ i <!(,-'. r- ,■''). .2) 

For a — 1 they are defined by the limits Hi = lim a ^i H a and 1% = lim Q _yi I a . In fact, Shannon 
(differential) entropy and the Shannon mutual information are just special cases of Renyi entropy 
and Renyi mutual information with a = 1. 

The goal of this paper is to present estimators of Renyi entropy ([T} and Renyi information |2]) and 
study their convergence properties. To be more explicit, we consider the problem where we are 
given i.i.d. random variables X 1: „ = (X 1; X 2 , . . . , X n ) where each Xj = (Xj, X 2 , . . . , X d ) has 

density / : M. d — > K and marginal densities fi : K — > K and our task is to construct an estimate 
ff Q (Xi ;ra ) of H a (f ) and an estimate I a (Xi :n ) of I a (f) using the sample Xi :n . 



3 Generalized Nearest-Neighbor Graphs 

The basic tool to define our estimators is the generalized nearest-neighbor graph and more specifi- 
cally the sum of the p-th powers of Euclidean lengths of its edges. 

Formally, let V be a finite set of points in an Euclidean space R d and let S be a finite non-empty 
set of positive integers; we denote by k the maximum element of S. We define the generalized 

2 We use superscript for indexing dimension coordinates. 

3 The base of the logarithms in the definition is not important; any base strictly bigger than 1 is allowed. 
Similarly as with Shannon entropy and mutual information, one traditionally uses either base 2 or e. In this 
paper, for definitiveness, we stick to base e. 
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nearest-neighbor graph NNs(V) as a directed graph on V. The edge set of NNs(V) contains 
for each i £ S an edge from each vertex x G V to its i-th nearest neighbor. That is, if we sort 
V\ {x} = {y l5 y 2 , • • • ,yiyi-i} according to the Euclidean distance to x (breaking ties arbitrarily): 
Il x — yill < Il x — y"2 1 1 < ' ' ' < ll x — yiv|-i|| then Yi is the *-th nearest-neighbor of x and for each 
i E S there is an edge from x to y^ in the graph. 

For p > let us denote by L p (V) the sum of the p-th powers of Euclidean lengths of its edges. 
Formally, 

L P (V)= ll x -yir> ( 3 ) 

(x,y)6E(AW s (V)) 

where E(NNs(V)) denotes the edge set of NNg(V). We intentionally hide the dependence on S 
in the notation L p (V). For the rest of the paper, the reader should think of S as a fixed but otherwise 
arbitrary finite non-empty set of integers, say, £ = {1, 3, 4}. 

The following is a basic result about L p . The proof can be found in the appendix. 

Theorem 1 (Constant 7). Let X 1: „ = (X l5 X 2 , . . . , X n ) be an i.i.d. sample from the uniform 
distribution over the d-dimensional unit cube [0, l] d . For any p > and any finite non-empty set S 
of positive integers there exists a constant 7 > such that 

L p (Xi :n ) 

J™. p _,_ p/d = 7 a.s. (4) 



The value of 7 depends on d, p, S and, except for special cases, an analytical formula for its value is 
not known. This causes a minor problem since the constant 7 appears in our estimators. A simple 
and effective way to deal with this problem is to generate a large i.i.d. sample Xi :n from the uniform 
distribution over [0, l] d and estimate 7 by the empirical value of L p (Xi :n ) /n 1 ^ p / d . 

4 An Estimator of Renyi Entropy 

We are now ready to present an estimator of Renyi entropy based on the generalized nearest-neighbor 
graph. Suppose we are given an i.i.d. sample Xi :n = (Xi, X2, . . . , X n ) from a distribution p, over 
E d with density /. We estimate entropy H a (f) for a € (0, 1) by 

g a (X l!w ) = — !— log where p = d(l - a), (5) 

and Lp(-) is the sum of p-th powers of Euclidean lengths of edges of the nearest-neighbor graph 
NNs(-) for some finite non-empty S C N + as defined by equation ([3j. The constant 7 is the same 
as in Theorem [TJ 

The following theorem is our main result about the estimator H a . It states that H a is strongly con- 
sistent and gives upper bounds on the rate of convergence. The proof of theorem is in the appendix. 

Theorem 2 (Consistency and Rate for H a ). Let a € (0, 1). Let /1 be an absolutely continuous 
distribution over M. d with bounded support and let f be its density. IfX.i :n = (Xi, X 2 , . . . , X n ) is 
an i.i.d. sample from [i then 

lim H a {X 1:n ) =H a (f) a.s. (6) 

n— >oo 

Moreover, if f is Lipschitz then for any 8 > with probability at least 1 — 5, 



#ce(Xl ;n ) — H a (f) 



O f n -dm^) (\og(l/8)) 1 / 2 - p ^ 2d ^ , if0<p<d~l; 
o\n^^)(\og{l/5)) 1 / 2 -P^ 2d ^ , ifd-l<p<d. 



5 Copulas and Estimator of Mutual Information 

Estimating mutual information is slightly more complicated than estimating entropy. We start with a 
basic property of mutual information which we call rescaling. It states that if hi, h,2, ■ ■ ■ , hd : M. —> 
R are arbitrary strictly increasing functions, then 

I a (hi(X 1 ),h 2 (X 2 ),...,h d (X d )) = I a (X\X 2 ,...,X d ) . (8) 
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A particularly clever choice is hj — Fj for all 1 < j < d, where Fj is the cumulative distribution 
function (c.d.f.) of XK With this choice, the marginal distribution of hj(X J ) is the uniform distri- 
bution over [0, 1] assuming that Fj, the c.d.f. of X 3 , is continuous. Looking at the definition of H a 
and I a we see that 

I a (X\X 2 , ...,X d )= UF^X 1 ), F 2 (X 2 ), F d (X d )) = -H a (F 1 (X 1 ) 1 F 2 (X 2 ), F d (X d )) . 

In other words, calculation of mutual information can be reduced to the calculation of entropy pro- 
vided that marginal c.d.f.'s Fi, F%, . . . , Fa are known. The problem is, of course, that these are not 
known and need to be estimated from the sample. We will use empirical c.d.f.'s (F 1: F 2 , ■ ■ ■ , F d ) 
as their estimates. Given an i.i.d. sample Xi : „ = (Xi, X2, . . . , X„) from distribution p and with 
density /, the empirical c.d.f 's are defined as 

Fj(x) = -\{i : 1 < i < n, x < Xf}\ for x £ E, 1 < j < d . 

Introduce the compact notation F : R d -t [0, l] d , F : R d -> [0, l] d , 

F(x 1 ,x 2 , . . . ,x d ) = {F 1 (x 1 ),F 2 (x 2 ),...,F d (x d )) fov(x\x 2 , 
F{x\x 2 ,...,x d ) = {F 1 (x 1 ),F 2 (x 2 ),...,F d (x d )) ior(x\x 2 , 



...,x d )eR d ; (9) 
...,x d ) e R d . (10) 



Let us call the maps F, F the copula transformation, and the empirical copula transformation, 
respectively. The joint distribution of F(X) = (FiiX 1 ) , F 2 {X 2 ) , ... , F d (X d )) is called the copula 
of fx, a nd the sample (Zj,Z 2 , . . . , Z„) = (F(Xi), F(X 2 ), . . . , F(X„)) is called the empirical 
copula (Dedecker et al. 2007 1. Note that j-th coordinate of equals 

Zi = ^ea i k(Xi{Xi,Xl...,Xy}), 

where rank(a;,A) is the number of element of A less than or equal to x. Also, observe 
that the random variables Z],, Z2, . . . , Z n are not even independent! Nonetheless, the empiri- 
cal copula (Zi, Z 2 , . . . , Z„) is a good approximation of an i.i.d. sample (Zi, Z 2 , . . . , Z„) = 
(F(Xi), F(X2), . . . , F(X n )) from the copula of /1. Hence, we estimate the Renyi mutual infor- 
mation I a by 

^a(Xi :n ) = — H a (Zi, Z2, . . . , Z n ), (11) 

where H a is defined by The following theorem is our main result about the estimator I a . It 
states that I a is strongly consistent and gives upper bounds on the rate of convergence. The proof of 
this theorem can be found in the appendix. 

Theorem 3 (Consistency and Rate for I a ). Let d > 3 and a = 1 — p/d € (1/2, 1). Let fi be an 
absolutely continuous distribution over R d with density /. If Xi :n = (Xi, X 2 , . . . , X„) is an i.i.d. 
sample from fi then 

lim 4(Xi :B ) = I a (f) a.s. 

n— >oc 

Moreover, if the density of the copula of fi is Lipschitz, then for any 5 > with probability at least 
1-6, 

' O fmaxjn" 3 !^) ,n-P/ 2 +P/ d }(log(l/5)) 1 / 2 ] , ifO < p < 1 ; 
I a (Xi m ) -/„(/) < I O maxK^.ri-Wjflogtl/i)) 1 / 2 ), ifl<p<d-l; 

O ^max{n _3 lST),n- 1 / 2 +P/ d }(log(l/(5)) 1 / 2 ) , ifd~l<p<d. 



6 Experiments 

In this section we show two numerical experiments to support our theoretical results about the con- 
vergence rates, and to demonstrate the applicability of the proposed Renyi mutual information esti- 
mator, I a . 
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6.1 The Rate of Convergence 



In our first experiment (Fig. |T i, we demonstrate that the derived rate is indeed an upper bound on 
the convergence rate. Figure 



lallc 



show the estimation error of I a as a function of the sample 
size. Here, the underlying distribution was a 3D uniform, a 3D Gaussian, and a 20D Gaussian with 
randomly chosen nontrivial covariance matrices, respectively. In these experiments a was set to 0.7. 
For the estimation we used S = {3} (kth) and S = {1, 2, 3} (knn) sets. Our results also indicate that 
these estimators achieve better performances than the histogram based plug-in estimators (hist). The 
number and the sizes of the bins were determined with the rule of |Scott| ( [l979) >. The histogram based 
estimator is not shown in the 20D case, as in this large dimension it is not applicable in practice. The 
figures are based on averaging 25 independent runs, and they also show the theoretical upper bound 
(Theoretical) on the rate derived in Theorem [3] It can be seen that the theoretical rates are rather 
conservative. We think that this is because the theory allows for quite irregular densities, while the 
densities considered in this experiment are very nice. 
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(a) 3D uniform (b) 3D Gaussian (c) 20D Gaussian 

Figure 1: Error of the estimated Renyi informations in the number of samples. 



6.2 Application to Independent Subspace Analysis 

An important application of dependence estimators is the Independent Subspace Analysis problem 



(Cardoso 1998). This problem is a generalization of the Independent Component Analysis (ICA), 



where we assume the independent sources are multidimensional vector valued random variables. 
The formal description of the problem is as follows. We have S = (S 1 ; . . . ; S m ) £ K dm , m inde- 
pendent d-dimensional sources, i.e. S l £ R d , and 7(S\ . . . , S m ) = oQ In the ISA statistical model 
we assume that S is hidden, and only n i.i.d. samples from X = AS are available for observation, 
where A £ jjsxdm j s an unknown invertible matrix with full rank and q > dm. Based on n i.i.d. 
observation of X, our task is to estimate the hidden sources S* and the mixing matrix A. Let the 
estimation of S be denoted by Y = (Y 1 ; . . . ; Y m ) <E R dm , where Y = WX. The goal of ISA is 
to calculate argmi n w J( Y 1 , . . . , Y m ), where W <E ^dmxq j s a man -j x f u \\ rank. Following the 
ideas of Cardoso ( |1998| l, this ISA problem can be solved by first preprocessing the observed quan- 
tities X by a traditional ICA algorithm which provides us ~Wjca estimated separation matrijP] and 
then simply grouping the estimated ICA components into ISA subspaces by maximizing the sum of 
the MI in the estimated subspaces, that is we have to find a permutation matrix P £ {0, i} dmxdm 
which solves 

m 

max^/^Y/,...,!^). (12) 
j=i 

where Y = PW/caX. We used the proposed copula based information estimation, I a with 
a = 0.99 to approximate the Shannon mutual information, and we chose S = {1,2,3}. Our 
experiment shows that this ISA algorithm using the proposed MI estimator can indeed provide good 



4 Here we need the generalization of MI to multidimensional quantities, but that is obvious by simply re- 
placing the ID marginals by d-dimensional ones. 



5 for simplicity we used the FastICA algorithm in our experiments (Hyvarinen et al. 2001 1 
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estimation of the ISA subspaces. We used a standard ISA benchmark dataset from |Szab6 et al. 
( |2007| l; we generated 2,000 i.i.d. sample points on 3D geometric wireframe distributions from 6 
different sources independently from each other. These sampled points can be seen in Fig. 2a and 
they represent the sources, S. Then we mixed these sources by a randomly chosen invertible matrix 
A € M 18x18 . The six 3-dimensional projections of X = AS observed quantities are shown in 
Fig. 2b Our task was to estimate the original sources S using the sample of the observed quantity 



X only. By estimating the MI in ( fL2) , we could recover the original subspaces as it can be seen 
in Fig. [2c] The successful subspace separation is shown in the form of Hinton diagrams as well, 
which is the product of the estimated ISA separation matrix W = PWjca an d A. It is a block 



permutation matrix if and only if the subspace separation is perfect (Fig. 2d I 



• <• • [A >f- Q 

(a) Original (b) Mixed (c) Estimated 

Figure 2: ISA experiment for six 3-dimensional sources. 




(d) Hinton 



7 Further Related Works 



As it was pointed out earlier, in this paper we heavily built on the results known from the theory of 
Euclidean functionals ( |Steele| [T997[ |Redmond and Yukich| [T996[ |Koo and Lee] [2007] ). However, 
now we can be more precise about earlier work concerning nearest-neighbor based Euclidean func- 

" T998] l, 



tionals: The closest to our work is Section 8.3 of Yukich 
based p-power weighted Euclidean functionals with S = {1, 2, 



where the case of NNg graph 
. , k} and p = 1 was investigated. 



Nearest-neighbor graphs have first been proposed for Shannon entropy estimation by |Kozachenko| 



and Leonenko 



1987] >. In particular, in the mentioned wo rk only the case of NNg graphs with 
S = {1} was considered. More recently, Goria et al. (2005 1 generalized this approach to S = {k} 
and proved the resulting estimator's weak consistency under some conditions on the density. The 
estimator in this paper has a form quite similar to that of ours: 



= log(n - 1) - 



5> 



Here ip stands for the digamma function, and is the directed edge pointing from X^ to its k th 
nearest-neighbor. Comparing this with |5]), unsurprisingly, we find that the main difference is the 
use of the logarithm function instead of | • | p and the different normalization. As mentioned before, 
Leonenko et al.|( [2008[ > proposed an estimator that uses the NNg graph with S — {k} for the purpose 
of estimating the Renyi entropy. Their estimator takes the form 



H r, = 




where T stands for the Gamma function, Cu = 



is the volume of the d-dimensional unit ball, and again is the directed edge in the NNg graph 
starting from node X^ and pointing to the fc-th nearest node. Comparing this estimator with Q, 
it is apparent that it is (essentially) a special case of our NNg based estimator. From the results 
of Leonen ko et al.| ( |2008| l it is obvious that the constant 7 in |5) can be found in analytical form 
when S = {kj. However, we kindly warn the reader again that the proofs of these last three cited 
articles (Kozac henko and Leonenko |1987||Goria et al. 2003) [Leonenko et al.| |2008| > contain a few 
errors, just lik e the |Wang et al. ( 2009b| l paper for KL divergence estimation from two samples. 
Kraskov et al. ( 20041 also proposed a fc-nearest-neighbors based estimator for the Shannon mutual 
information estimation, but the theoretical properties of their estimator are unknown. 



7 



8 Conclusions and Open Problems 



We have studied Renyi entropy and mutual information estimators based on NN$ graphs. The 
estimators were shown to be strongly consistent. In addition, we derived upper bounds on their 
convergence rate under some technical conditions. Several open problems remain unanswered: 

An important open problem is to understand how the choice of the set S C N + affects our estimators. 
Perhaps, there exists a way to choose S as a function of the sample size n (and d, p) which strikes 
the optimal balance between the bias and the variance of our estimators. 

Our method can be used for estimation of Shannon entropy and mutual information by simply using 
a close to 1. The open problem is to come up with a way of choosing a, approaching 1, as a 
function of the sample size n (and d, p) such that the resulting estimator is consistent and converges 
as rapidly as possible. An alternative is to use the logarithm function in place of the power function. 
However, the theory would need to be changed significantly to show that the resulting estimator 
remains strongly consistent. 

In the proof of consistency of our mutual information estimator I a we used Kiefer-Dvoretzky- 
Wolfowitz theorem to handle the effect of the inaccuracy of the empirical copula transformation. 
Our particular use of the theorem seems to restrict a to the interval (1/2, 1) and the dimension 
to values larger than 2. Is there a better way to estimate the error caused by the empirical copula 
transformation and prove consistency of the estimator for a larger range of a's and d = 1,2? 

Finally, it is an important open problem to prove bounds on converge rates for densities that have 
higher order smoothness {i.e. /3-H61der smooth densities). A related open problem, in the context of 
of theory of Euclidean functionals, is stated in |Koo and Lee | ( [2007 [ ). 
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A Quasi-Additive and Very Strong Euclidean Functionals 



The basic tool to prove convergence properties of our estimators is the theory of quasi-additive 
Euclidean functionals developed by |Yukich| ( |1998[ ); |Steele| ( |1997| l; |Redmond and Yukich| ([1996); 



Koo and Lee (2007 1 and others. We apply this machinery to the nearest neighbor functional L p 
defined in equation (3]). 



In particular, we use the axiomatic definition of a quasi-additive Euclidean functional from Yukich 
( |1998) > and the definition of a very strong Euclidean functional from Koo and Lee (200 7} w ho add 
two extra axioms. We then use the results of Red mond and Yukich| ( |1996| ) and |Koo and Lee| ( |2007| ) 
which hold for these kinds of functionals. These results determine the limit behavior of the func- 
tionals on a set of points chosen i.i.d. from an absolutely continuous distribution over M. d . As we 
show in the following sections, the nearest neighbor functional L p defined by equation ([3]) is a very 
strong Euclidean functional and thus both of these results apply to it. 

Technically, a quasi-additive Euclidean functional is a pair of real non-negative functionals 
(L p (V) , L*(V, B)) where B C R d is a e?-dimensional cube and V C B is a finite set of points. 

Here, a d-dimensional cube is a set of the form nf=i[ a N a% + s ] where (a 1 , a 2 , . . . , a d ) e R d is its 
"lower-left" corner and s > is its side. The functional L* is called the boundary functional. The 
common practice is to neglect L* p and refer to the pair (L p (V) , L*(V, B)) simply as L p . We provide 
a boundary functional L* p for the nearest neighbor functional L p in the next section. 

Definition 4 (Quasi-additive Euclidean functional). L p is a quasi-additive Euclidean functional of 
power p if it satisfies axioms (A1)-(A7) below. 

Definition 5 (Very strong Euclidean functional). L p is a very strong Euclidean functional of power 
p if it satisfies axioms (A1)-(A9) below. 

Axioms. For all cubes B C R d , any finite V C B, all y e R d , all t > 0, 

£ P (0)=O; L;(0,S) = O; (Al) 

L p (y + V) = L p (V) ; L* p (y + V, y + B) = L* p {V, B) ; (A2) 

L p (tV) = t p L p (V) ; L* p (tV, tB) = t p L p {V, B) ; (A3) 

L P (V) >L* p (V,B) . (A4) 

For all V C [0, l] d and a partition {Qi : 1 < i < m d } of [0, l] d into m d subcubes of side 1/m 

m d m d 

l p (v) < l p(v n Qi) + o(m d -p) , l;(v, [o, i] d ) > L * P (v n Qi, [o, i] d ) - o(m d -p) . 

1=1 i=l 

(A5) 

For all finite V, V C [0, l} d , 

\L P (V) - L P (V)\ < Od^'A^I 1 -^) ; \L* p (V, [0, l] d ) - L* p (V, [0, l] d )| < 0{\V &V\ l - p ' d ) 

(A6) 

For a set U n of n points drawn i.i.d. from the uniform distribution over [0, l] d , 

\EL p (U n ) -EL* p (U n , [0, l] d )| < o{n l -v' d ) ; (A7) 
\EL p (U n ) -EL* p (U n , [0, l] d )| < 0{m*x{n x - p ' d - l ' d , 1)) ; (A8) 
\EL p (U n )-EL p {U n+1 )\ < 0{n-P' d ) . (A9) 

Axiom (A2) is translation invariance, axiom (A3) is scaling. First part of (A5) is subadditivity of 
L p and second part is super-additivity of L* Axiom (A6) is smoothness and we call (A7) quasi- 
additivity. Axiom (A8) is a strengthening of (A7) with an explicit rate. Axiom (A9) is the add-one 
bound. The axioms in |Koo and Lee| ( [2007[ ) are slightly different, however it is a routine to check that 
they are implied by our set of axioms. 

We will use two fundamental results about Euclidean functionals. The first is ( |Redmond and Yukich| 
|1996| Theorem 2.2) and the second is essentially ( Ko o and Lee| 12007] Theorem 4). 
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/^(x) dx 



Theorem 6 (Redmond- Yukich). Let L p be quasi-additive Euclidean functional of power < p < d. 
Let V n consist of n points drawn i.i.d. from an absolutely continuous distribution over [0, l] d with 
common probability density function f : [0, l] d — > R. Then, 

n^oo n^-p/d J [01] 

where 7 := j(L p , d) is a constant depending only on the functional L p and d. 
Theorem 7 (Koo-Lee). Let L p be a very strong Euclidean functional of power < p < d. Let V n 
consist ofn points drawn i.i.d. from an absolutely distribution over [0, l) d with common probability 
density function f : [0, l] d — > K. If f is Lipschitz^ then 

EL p (V n ) ' ( 



[0,1]' 



/^(x) dx 





H 











n d(2d-p) 



" d(d+l) 



if0<p<d-l; 
ifd—l<p<d, 



where 7 is the constant from Theorem^ 



Theorem [7] differs from its original statement ( |Koo and Lee 2007 Theorem 4) in two ways. First, 
our version is restricted to Lipschitz densities. Koo and Lee prove a generalization of Theorem|7]for 
/3-H61der smooth density functions. The coefficient (3 then appears in the exponent of n in the rate. 
However, their result holds only for /3 in the interval (0, 1] which does not make it very interesting. 
The case f3 = 1 corresponds to Lipschitz densities and is perhaps the most important in this range. 
Second, Theorem [7] has slight improvement in the rate. Koo and Lee have an extraneous log(n) 
factor which we remove by "correcting" their axiom (A8). 

In the next section, we prove that the nearest neighbor functional L p defined by pi is a very strong 
Euclidean functional. First, in section |Bj we provide a boundary functional Lr p for L p . Then, 
in sectionjc] we verify that {L p ,L* p ) satisfy axioms (A1)-(A9). Once the verification is done, 
Theorem Qjrollows from Theorem [6] 

Theorem [2] will follow from Theorem [7] and a concentration result. We prove the concentration 
result in Section[D]and finish that section with the proof of Theorem[2] Proof of Theorem[3 requires 
more work — we need to deal with the effect of empirical copula transformation. We handle this in 
Section |E]by employing the classical Kiefer-Dvoretzky-Wolfowitz theorem. 



B The Boundary Functional L* 

We start by constructing the nearest neighbor boundary functional L*. For that we will need to 
introduce an auxiliary graph, which we call the nearest-neighbor graph with boundary. This graph 
is related to NN$ and will be useful later. 

Let B be a d-dimensional cube, V C B be finite, and S C N + be non-empty and finite. We define 
nearest-neighbor graph with boundary NNg(V, B) to be a directed graph, with possibly parallel 
edges, on vertex set V U dB, where OB denotes the boundary of B. Roughly speaking, for every 
vertex xeF and every i 6 S there is an edge to its "z-th nearest-neighbor" inFU dB. 

More precisely, we define the edges from x e V as follows: Let b e dB be the boundary point 
closest to x. (If there are multiple boundary points that are the closest to x we choose one arbitrarily.) 
If (x,y) e E{NN S {V)) and ||x - y|| < ||x - b|| then (x,y) also belongs to E(NN£(V,B)). 
For each (x, y) e E(NN S {V)) such that ||x - y|| > ||x - b|| we create in NNg(V,B) one 
copy of the edge (x, b). In other words, there is a bijection between edge sets E(NNs(V)) and 
E(NNg(V, B)). An example of a graph NNs(V) and a corresponding graph NNg(V) are shown 
in Figure [3] 

Analogously, we define L*(V, B) as the sum of p-powered edges of NNg(V, B). Formally, 

l;{v,b)= J2 !l x -yll p - (13) 

{xi,y)eE(NN£(V,B)) 

6 Recall that a function / is Lipschitz if there exists a constant C > such that \f(x) — f{y)\ < C\\x — y\\ 
for all x, y in the domain of /. 
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(a) NN S (V) 



(b) NNg(V,B) 



Figure 3: Figure (a) shows an example of a nearest neighbor graph NNg(V) in two dimensions and 
a corresponding boundary nearest neighbor graph NN$(V, B)* is shown in Figure (b). We have 
used S = {1}, B = [0, l] 2 and a set V consisting of 13 points in B. 



We will need some basic geometric properties of NNg(V, B) and L*(V, B). By construction, the 
edges of NNg(V, B) are shorter than the corresponding edges of NNs(V). As an immediate 
consequence we get the following proposition. 

Proposition 8 (Upper Bound). For any cube B, any p > and any finite set V C B, L* (V, B) < 

L P (V). 



C Verification of Axioms (A1)-(A9) for (L p , L* 



It is easy to see that the nearest neighbor functional and its boundary functional L* satisfy axioms 
(A1)-(A3). Axiom (A4) is verified by Proposition k\ It thus remains to verify axioms (A5)-(A9) 



which we do in subsections C. 1 C.2 and C.3 We start with two simple lemmas. 
Lemma 9 (In-Degree). For any finite V C K d the in-degree of any vertex in NNg{V) is 0(1). 



Proof. Fix a vertex x e V. We show that the in-degree of x is bounded by some constant that 
depends only on d and k — max S. For any unit vector u € M. d we consider the convex open cone 
C(x, u) with apex at x, rotationally symmetric about its axis u and angle 30°: 

Q(x,u) = |yeM d : u • (y - x) < ^||u - x|| | . 

As it is well known, R d \ {x} can be written as a union of finitely many, possibly overlapping, cones 
Q(x, Ui), Q(x, u.2), . . . , Q(x, ub), where B depends only on the dimension d. We show that the 
in-degree of x is at most kB. 

Suppose, by contradiction, that the in-degree of x is larger than kB. Then, by pigeonhole principle, 
there is a cone Q(x, u) containing k + 1 vertices of the graph with an incoming edge to x. Denote 
these vertices y l5 y 2 , . . . , yt+i and assume that they are indexed so that ||x — y x | < ||x — y 2 1| < 
••• < ||x-y fe+ i||. 

By a simple calculation, we can verify that ||x — yfe+i|| > ||yi — y/b+ill f° r all 1 < i < fc. Indeed, 
by the law of cosines 

||y< - Yfc+ill 2 = ||x - y 4 || 2 + ||x - y fc+1 || 2 - 2(x - yi ) • (x - y fe+1 ) 

< Hx-yilp + llx-yfe+ill 2 - ||x- yi ||||x-y fc+ i|| < ||x-y fe+ i|| 2 , 
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where the sharp inequality follows from that yk+i, y; G Q(x, u) and so the angle between vectors 
(x — y,) and (x — y^+i) is strictly less than 60°, and the second inequality follows from ||x — y^|| < 
j|x — yfe+i||. Thus, x cannot be among the k nearest-neighbors of yfc+i which contradicts the 
existence of the edge (y/c+i, x). □ 

Lemma 10 (Growth Bound). For any p > and finite V C [0, l] d , L p (V) < 0(max(|F| 1 ~ p / <J , 1)). 



Proof. An elegant way to prove the lemma is with the use of space-filling curves]^] Since |Peano 



(1890) and Hilbert ( 1891 ), it is known that there exists a continuous function xp from the unit interval 
[0, 1J onto the cube [0, \\ d {i.e. a surjection). For obvious reason i/> is called a sp ace-filling curve. 
Moreover, there are space-filling curves which are (1/d) -Holder; see Milne ( 19801. In other words, 
we can assume that there exists a constant C > such that 

\\iP(x) - xP(y)\\ < C\x - y\^ d Vs, y e [0, 1]. (14) 

Since ij) is a surjective function we can consider a right inverse tp- 1 : [0, l] d ->• [0, 1] i.e. a function 
such that xp(tp~ 1 (x)) = x and we let W = if)" 1 ^). Let < Wi < < ■ ■ ■ < w \v\ < 1 be the 
points of W sorted in the increasing order. We construct a "nearest neighbor" graph G on W. For 
every 1 < j < \V\ and every i s S we create a directed edge (uij, Wj+i), where the addition i + j 
is taken modulo \ V\. It is not hard to see that the total length of the edges of G is 

\x-y\<o(k 2 ) = oii) (15) 

(x,y)eE(G) 

To see more clearly why ( fl"5j ) holds, note that every line segment [v>i, ti>j+i], 1 < i < \V\ belongs 
to at most 0(k 2 ) edges and the total length of the line segments is X)i=i _1 i w i+i ~ w i) < 1- 

Let H be a graph onVc [0, l] d isomorphic to G, where for each edge (wi, Wj) E E(G) there is a 
corresponding edge (if>(wi),ip(wj)) E E(H). By the construction of H 

l p (v)< Y, H x -yll p = E WW-iKvW- (i6) 

(x,y)eE(H) (x,y)GE(G) 

Holder property of if) implies that 

Y \Mx)-i>(y)\\ p < c Y \ x ~y\ p/d - ( 17 > 

(x,y)€E(G) (x,y)£E{G) 

If p > d then \x — y\ p ^ d < \x — y\ since \x — y\ E [0, 1] and thus 

Y \*-y\ p/d < Y \ x ~y\- 

(x,y)dE(G) (x,y)£E(G) 

Chaining the last inequality with ( fT6) >, ( fT7] l and (|T3J we obtain that L p (V) < 0(1) forp > d. 

If < p < d we use the inequality between arithmetic and (p/d)-mean. It states that for positive 
numbers ax, a 2 , ■ ■ ■ , a n 

/^n p/d\ d /P n / n \P/ d 

j ^^f^ or equivalents Y a T ^ ( E A ■ 

In our case <Zj's are the edge length of G and n < k\V\, and we have 

P /d 

Y \x-y\ p/d <{k\v\) 1 - p/d \ Y \ x ~y\ 

(x,y)£E(G) \(x,y)eE(G) 

Combining the last inequality with ([16]), ^7} and ([B]) we get that L p (V) < 0{\V\ l ~ p/d ) for < 
p < d. 

Finally, forp = 0, L p (V) < k\V\ = 0{\V\). □ 

7 There is an elementary proof, too, based on a discretization argument. However, this proof introduces an 
extraneous logarithmic factor when p = d. 
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C.l Smoothness 



In this section, we verify axiom (A6). 

Lemma 11 (Smoothness of L p ). For p > and finite disjoint V,V C [0, l] d , \L p (V' U V) 
L p (V')\<0{ma,x(\V\ 1 ~P/ d ,l)). 

Proof. Forp > d the lemma trivially follows from the growth bound L p (V') — 0(1), L p (V'UV) 
0(\). For < p < d, we need to prove two inequalities: 

L P (V'UV) <L p {V')+0{\V\ l - p ' d ) and L p (V') < L P (V' U V) + 0{\V\ 1 ~ p/d ) . 

We start with the first inequality. We use the obvious property of L p that L p (V' U V) < L p (V) 
L p (V) + 0(1). Combined with the growth bound (Lemma 10 1 for V we get 



L P (V'UV) < L p (V')+L p (V)+0(l) < L p (l/')+0(|^r-f/ d )+0(l) < L p (V')+0{\V\^ P,d ) ■ 

The second inequality is a bit more tricky to prove. We introduce a generalized nearest-neighbor 
graph NN S (W, W) for any pair of finite sets W,W such that W C W C R d . We define 
NN S (W, W) as the subgraph of NN S (W') where all edges from W \ W are deleted. Similarly, 
we define L p (W, W) as the sum p-powered lengths of edges of NNs(W, W): 

l p (w,w>)= Yl !l x -yir- 

(x,y)£E(NN s (W,W')) 

We will use two obvious properties of L p (W, W) valid for any finite W C W' C M. d : 

L P (W,W) = L P (W) and L P (W, W) < L P (W) + 0(1) . (18) 

Let U C V be the set of vertices x such that in NNs(V U V) there exists an edge from x to a 
vertex V. Using the two observations and the growth bound we have 

L p (V) = L p (V', V) = L p (U, V) + L p (V' \ U, V) < L p {U) + 0(1) + L p (V \ U, V) 

< 0{\U\ l - p ' d ) + L P (V '\U,V') ■ 

The term is L p (V' \U,V) can be upper bounded by L p (V' U V) since by the choice of U the graph 
NN S (V \ U, V) is a subgraph of NN S (V U V). The term 0{\U\ 1 - p / d ) is at most O^V^-p/ 11 ) 
since \U\ is upper bounded by the number of edges of NNs(V U V) ending in V and, in turn, the 
number of these edges is by the in-degree lemma at most 0(|V|). □ 

Corollary 12 (Smoothness of L p ). For p > and finite V, V C [0, l] d , 

\L P (V) - L P (V)\ < OimaxUV'AVl 1 -*'*, 1)), 
where V' AV denotes the symmetric difference. 

Proof. Applying the previous lemma twice 

\L p (V) - L p {V)\ < \L p (V) - L p (V U V)\ + \L p (V UV)- L p (V)\ 

= \L P (V) - L p (V U (V \ V'))\ + \L p (V U (V \ V)) - L p (V)\ 
< 0(max(|U \ V'\ x - p ' d , 1)) + 0(max(\V \ V^' 11 , 1)) 
= 0{m&yi(\V l /W\ 1 - p/d ,l)) . 



□ 



Lemma 13 (Smoothness of L*). Forp > and finite disjoint V, V C [0, l] d , 

\L* p (V U V, [0, l] d ) - L* p (V', [0, l] d )\ < 0(ma X (\V\ 1 ^ p/d , i)) 
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f. The proof of the lemma is identical to the proof of Lemma [IT] if we replace L p (-) 
:;(-,[0,l] d ), NN*(-) by NN S (; [0, L p ( v ) by L* p (; [0, If ) and NN S (-, ■) by 



by L;(-,[0,l] d ), 7V7V*(.) by AW S (-, [0, If), L p ( v ) by ■, [0, l] d ) and NN S (-, ■) by 

AW|(-, ■, [0, l} d ). We, of course, need to explain what NN%(V, W, [0, l} d ) and VK, [0, 1]) 

mean. For V C VK, we define (y, W, [0, as the subgraph of NN* S (W, [0, where the 
edges starting in W \ V are removed, and L*(V, W, [0, l] d ) is the sum the p-th powers of Euclidean 
lengths of edges of NN*(V, W, [0, l] d ). □ 

Corollary 14 (Smoothness of L*). Forp>0 and finite V, V C [0, l] d , 

\L* p (V, [0, l] d ) - L;(V, [0, < Oim^dV'AVl 1 -^, 1)) , 
where V' AV denotes the symmetric difference. 



Proof. The corollary is proved in exactly the same way as Corollary 12 where L p {-) is replaced by 

L;(-,[o,i} d ). ' ' ' □ 



C.2 Subadditivity and Superadditivity 

In this section, we verify axiom (A5). 

Lemma 15 (Subadditivity). Let p > 0. For m G N + consider the partition {Qi : 1 < i < m d } o/ 
f/ie cwfee [0, l] d mfo m d disjoint subcube^of side 1/m. For any finite V C [0, 

L P (V) < L p(V n Qi) + 0(max(m d -P, 1)) . (19) 

i=l 

Proof. Consider a subcube Qi which contains at least k+l points. Using the "L p (W, W) notation" 
from the proof of Lemma [TT] 

l p {v n Q h v) < l p (v n Qi, v n Qi) = l p (v n Qi) . 

Let R be the union subcubes that contain at most k points. Clearly | V D R\ < km d . Then 

L p (V) =L P (V,V) 

= L P (VDR,V)+ L p {VnQi,V) 

l<i<m d 
\VnQi\>k+l 

d 

m 

< L p (v nR) + 0(1) + l p {v n Qi) , 

i=l 

where we have used the second part of ( fT8j ). The proof is finished by applying the growth bound 

L p (Vf]R) < 0(ma.x(\VnR\ 1 ~P/ d ,l)) < 0(max(m d -P,l)). " □ 



Lemma 16 (Superadditivity of L*). Let p > 0. For m e N + consider a partition {Qi : 1 < i < 
m} of [0, l] d into m d disjoint subcubes of side 1/m. For any finite V C [0, l] d , 

Y,L* p (VnQi,Qi) < L* p (V,[0,l} d ) . 

i=l 

Proof. We construct a new graph G by modifying the graph NNg(V, [0, l] d ). Consider any edge 
(x, y) such that x 6 Qi and y ^ Qi for some 1 < i < m . Let z be the point where dQi and the 
line segment from x to y intersect. In G, we replace (x, y) by (x, z). Note that the all edges of G 
lie completely in one of the subcubes Qi and they are shorter or equal to the corresponding edges in 
NN* S (V, [0, l] d ). 

8 In order the subcubes to be pairwise disjoint, most of them need to be semi-open and some of them closed. 
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Let L^p be the sum of p-th powers of the Euclidean length of the edges of G lying in Qi. Since 
edges in G are shorter than in NN* S {V, [0, l] d ), YT=i 4m < L* p (V, [0, l] d ). To finish the proof it 
remains to show that L*(V n Qi, Qi) < L itP for all 1 < i < m d . 

For any edge (x, z) in G from x e V n Qi to z e the point z e <9Qi is not necessarily the 
closest to x. Therefore, any edge in NNg(V n Qi, Qi) is shorter than the corresponding edge in 

G. □ 



C.3 Uniformly Distributed Points 

Axiom (A7) is a direct consequence of axiom (A8). Hence, we are left with verifying axioms (A8) 
and (A9). In this section, U n denotes a set of n points chosen independently uniformly at random 
from [0,1] d . 

Lemma 17 (Average Edge Length). Assume Xi, X2, . . . , X„ are chosen Ltd. uniformly at random 
from [0, \] d . Let k be a fixed positive integer. Let Z be the distance from Xi to k-th nearest-neighbor 
in {X 2 , X 3 , . . . , X n }. For any p > 0, 

E[Z p I Xi] < 0(n- p/d ) . 



Proof. We denote by B(x, r) = {y e R d : ||x — y|| < r} the ball of radius of r > centered at a 
point x e R d . Since Z lies in the interval [0, Vd] is non-negative, 



poo 

E[Z p I Xi] = / Pr[Z p > t I Xi] dt 
Jo 

rVd 

= p u p ^Vx[Z > u I Xi]du 
Jo 

rVd 

= p / u p - 1 Pi[\{X 2 ,X 3 ,...,X n }nB(X 1 ,u)\<k\X 1 }du 
Jo 

= p/^E( n ^ 1 )^ 1 [ Vol (5(Xi, u )n[o,i] d )] J 



[1- Vol(B(X 1 ,«)n[0,l] d )]" 1 j du 



<p 



2\fd fe-1 

E 

J=0 



n- 1 



u^ 1 [Vol(Xi,u) 



2^ 



n-l-j 



dw 



The last inequality follows from the obvious bound Vol(B(X 1 , u) n [0, l] d ) < Vol(B(X 1 ,u)) 
and that for u e [0, Vd\ the intersection B(Xi,u) n [0, l] d contains a cube of side at least ^3=. 

To simplify this complicated integral, we note that Vol(£?(Xi, u)) = Vol(B(Xi, l))u d and make 
substitution s = {^j^Y- The last integral can be bounded by a constant multiple of 



fc-i 

E 

3=0 



n — 1 



/ s p/d+J - 1 (l-s)"- 1 - J ds 
Jo 



Since ("^ x ) = 0(n J ) and the sum consists of only constant number of terms, it remains to show 

that the inner integral is 0{n~ p / d ~ 3 ). We can express the inner integral using the gamma function. 
Then, we use the asymptotic relation (™) = 6(n e ) for generalized binomial coefficients (£) = 
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r(b+i)r(a-fc+i) to u PP er -bound the result: 



f s f/ d +J- 1 (l- s ) n - 1 -Jds = 
Jo 



T(p/d + j)T(n-j) 
T(n + p/d) 
1 

n + p/d — 1 N 
p/d + j 

o( n - p / d - j ) . □ 



{p/d + j) 



Lemma 18 (Add-One Bound). For any p > 0, | E[L p (U n )} - E[i p (W n+ i)]| < 0{n- p / d ). 

Proof. Let Xi, X2, . . . , X„, X n+ i be i.i.d. points from the uniform distribution over [0, l] d . 
We couple U n and U n +i in the obvious way U n = {Xi, X2, . . . , X„} and U n +\ = 
{Xi, X2, . . . , X n+ i}. Let Z be the distance from X n+ i to fc-th closest neighbor in U n . The in- 
equality 

L p (U n+1 ) < L p (U n ) + \S\ZP 

holds since \S\Z P accounts for the edges from X n+1 and since the edges from U n are shorter 
(or equal) in NNs(U n +i) than the corresponding edges in U n +i- Taking expectations and using 
Lemma [17] we get 

E[L p (U n+1 )} < E[L p (U n )] + 0(n-P/ d ) . 

To show the other direction of the inequality, let Zj be the distance from Xi its (k + l)-the nearest 
point in U n+1 . (Recall that k = max S.) Let N(j) = {X, : (Xi, Xj-) G E(NN s {U n+1 ))} be the 
incoming neighborhood of Xj. Now if we remove Xj from NNs(V), the vertices in N(j) lose Xj 
as their neighbor and they need to be connected to a new neighbor in li n+ \ \ {Xj}. This neighbor 
is not farther than their (k + l)-th nearest-neighbor in U n +i. Therefore, 

L p (U n+1 \{X 3 })<L p (U n+1 )+ Yl Z i- 
Summing over all j = 1, 2, . . . , n + 1 we have 

n+l n+1 

^2L p (U n+1 \{Xj})<(n + l)L p (U n+1 ) + Y^ E Z i- 
3=1 i=ix 4 eAT(i) 

The double sum on the right hand side is simply the sum over all edges of N NsiU n +i) and so we 
can write 

n+1 n+1 

E \ i X ^) ^ ( n + !) W,+i) + |5| e Z i ■ 

j=l i=l 



Taking expectations and using Lemma 17 to bound E[Zf ] we arrive at 

(n + 1) E[L p (U n )] <{n+l) E[L p (U n+1 )] + (n + l)0{n-P' d ) . 

The proof is finished by dividing through by (n + 1). □ 

Lemma 19 (Quasi-additivity). < E[L p {U n )] - E[L*(U n , [0, l] d )] < 0(max(n 1 -P/ d - 1 /' i , I)) for 
any p > 0. 

Proof. The first inequality follows from Proposition[8]by taking expectation. The proof of the sec- 
ond inequality is much more involved. Consider the (random) subset of points U n C U n which 
are connected to the boundary in NNg(U n , [0, l] d ) by at least one edge. We use the notation 
L p (W,W) for any W C W' and its two properties expressed by Eq. (I81 and a third obvious 
property L p {W, W) < L P (W). We have 

L p {U n ) = L p (U n ,Un) = L p {U n Mn) + L p (U n \U n ,U n ) < L p {U n ) + 0(1) + L*(U n , [0, l] d ) , 
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F 



Figure 4: The left drawing shows the box B = [n l ' d , 1 — n 1 / d ] d c [0, l] d shown in gray. The 
right drawing shows partition of B into rectangles i?2, • ■ • , Rm- The diameter of the projection 
of each rectangle Ri on the right side F has diameter (at most) n~ x l d . In each rectangle Ri at most 
k points are connected to F by an edge. 



where in the last step we have used that L p (U n \U n ,U n ) < L*(U n , [0, l] d ) which holds since the 

edges from vertices U n \ U n are the same in both graphs NNs(U n ) and NNg(U n , [0, If we 
take expectation, we get 

E[L p (U n )] - E[L* p (U n , [0, l] d )] < E[L p {U n )] + 0(1) 

and we see that we are left to show that E[L p (U n )] < 0(ma,x(n 1 ^ p ^ d ^ 1 ^ d , 1)). In order to do that, 
we start by showing that 



E[\U n \] < 0( 



(20) 



Consider the cube B = [n" 1 /^ 1 - n - l / d ] d . We bound E[\U n H B\] and E[\U n n ([0, l] d \ B)\] 
separately. The latter is easily bounded by 0(n 1 ~ 1 / d ) since there are n points and the probability 
that a point lies in [0, l] d \Bis Vbl([0, l] d \ B) < 0{n- 1 / d ). We now bound \U n DB\. Consider a 
face of F. Partition B into m = 0(n 1-1 / d ) rectangles Ry, R2, . . . , R m such that the perpendicular 
projection of any rectangle Ri, 1 < i < m, on F has diameter at most n~ x / d and its (d — 1)- 
dimensional volume is see Figure^ It is not hard to see that, in U n n Ri, only the k 

closest points to F can be connected to F by an edge in NNg(U n , [0, l] d ). There are 2d faces and 
m rectangles and hence \U n n B\ < 2dkm — 0(n 1_1 / d ). We have thus proved (20 1. 



The second key component that we need is that the expected sum of p-th powers of lengths of edges 
of NNg{U n , [0, l] d ) that connect points in U n to d[0, l] d is "small". More precisely, for any point 
x e [0, l] d let b x e d[Q, l] d be the boundary point closest to x. We show that 



E 



xeu„ 



ilX-b 



x 



< Oin 1 



-p/d-l/d\ 



(21) 



We decompose the task as 



E 



E n x - b xir 

xew„ 



= E 



xew„ns 



IX-b 



x 



■E 



E 



X b 



x 



_xew„n([o,i] d \s) 



Clearly, the second term is bounded by E[\U n n ([0, l} d \B)\] = 0{n 1 - l ' d - 1 / p ). To bound 

the first term, consider a face F of the cube [0, l] d and a rectangle Ri in the decomposition of B 

into Ri, R2, . . . , R m mentioned above. Let Z be the distance of the fc-th closest point in U n n Ri 
to F. (If lA n H Ri contains less than k points, we define Z to be 1 — nT 1 / 11 .) Recall that only the k 
closest points of U n n Ri can be connected to F and this distance is bounded by Z. There are 2d 
faces, m = 0(n 1-1//d ) rectangles and at most k points in each rectangle connected to a face. If we 
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can show that E[ZP] = 0{n~P/ d ), we can upper bound the second term by 2dkm ■ 0(n x / p ) 
Q( n i- p /d-i/dj from which J2JI f n ows . 



We now prove that E[Z»] = 0(n-P/ d ). Let Y = Z - n~ 1 / d . Since E[ZP] < TP E[YP] + 2P n ~P/ d 
it suffices to to show that E[Y P ] = 0(n~ p / d ). Let q be the (d — 1) -dimensional volume of the 
projection of R4 to F. Recall that q = 6(n 1/d_1 ). Since Ye [0, 1 - 2n~ 1/d ] we have 

E[Y p ] = p / t p - l Vx[Y > t] dt 



l-2n 



-1/d 



fc-1 



3=0 



t^E ■ )(qt) j (l-qt) n - j dt 



< 



pq 



/ ri\ r(p + j)r(n - j + 1) 

j=0 VJ ' 



fc-i 



T(n+p+ 1) 



We now use (|20) and (|2l} to show that E[L p (U n )] < 0(max(n 1 - p / d - 1 /' i , 1)) which will finish the 
proof. For any point XgW„ consider the point bx lying on the boundary. Let V„ = {bx : X € 
U n } and let NNs(V n ) be its nearest-neighbor graph. Since V„ lies in a union of (d— 1) -dimensional 
faces, by the growth bound L p (V n ) < 0(max(|V„| 1 - p/(rf ~ 1) , 1)). Thus, if < p < d - 1 we use 
that x H> x l ~ p '^ <l ~ 1 ' is concave and (20l, and we have 



E[£ P (Vn)] < O (E \\V n \ 1 -^ d -^ 



<0[E \U n 



= O (E 

i- P /(d-i) 



If P > d — 1 then L p (V„) = 0(1). Therefore, for any p > 

E[i P (V„)] < Olmax^ 1 ^/^ 1 ^, 1)) 



(22) 



We construct a nearest-neighbor graph G on W„ by lifting NNs(V n ). For every edge, (bx, bv) in 
NNs(V n ) we create an edge (X, Y). Clearly, L p (U n ) is at most the sum of p-the powers of the 
edges lengths of G. By triangle inequality, for any p > 

||X - Yf < (||X - b x || + ||b x - b Y || + ||b Y - Y||f 

< 3P (||X - b x f + ||b x - b Y \\P + \\by - Y\\p) . 

In-degrees and out-degrees of G are O(l) and so if we sum over all edges of (X, Y) of G and take 
expectation, we get 



E[L p (U n )} <E[L p {V n )}+0 E 



E n x - b xir 

xeu n 



To upper the right hand side we use (21 1 and (22 1, which proves that E[L p (U n )] < 
0(max(n 1 ~ P/ ' d ~ 1//d , 1)) and finishes the proof. □ 
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D Concentration and Estimator of Entropy 



In this section, we show that if V„ is a set of n points drawn i.i.d. from any distribution over [0, l] d 
then L p (V n ) is tightly concentrated. That is, we show that with high probability L p (V n ) is within 
( n i/2-p/(2d)} its expected va i ue . We use this result at the end of this section to give a proof of 
Theorem |2] 

It turns out that in order to derive the concentration result, the properties of the distribution gener- 
ating the points are irrelevant (even the existence of density is not necessary). The only property 
that we exploit is smoothness of L p . As a technical tool, we use the isoperimetric inequality for 
Hamming distance and product measures. This inequality is, in turn, a simple consequence of Tala- 
grand's isoperim etric inequality, see e.g. jDubhashi and Panconesi| ( |2009| l; |Alon and Spencer| (j2000); 
Talagrand ( 1995 ). To phrase the isoperimetric inequality, we use Hamming distance i?(xi :n , yi :n ) 
between two tuples x i: „ = (xi, x 2 , . . . , x n ), y 1:n = (yi, y 2 , . . . , y„) which is defined as the 
number of elements in which Xi :n and yi :n disagree. 

Theorem 20 (Isoperimetric Inequality). Let A C il n be a subset of an n-fold product 
of a probability space equipped with a product measure. For any t > let A t = 
{xi :n G Q n : 3yi :rl G Q n s.t. if(xi :n , yi :n ) < t} be an expansion of A. Then, for any t > 0, 

Pr[A] PrpU] < exp 

where A t denotes the complement of A t with respect to 51™. 

Theorem 21 (Concentration Around the Median). Let V n consists ofn points drawn i.i.d. from an 
absolutely continuous probability distribution over [0, l] d , let < p < d. For any t > 0, 

Pr {\L p (V n ) - M(L p (V n ))\ > t] < e -e(t 2d/id - p) M , 
where M(-) denotes the median of a random variable. 




Proof. Let £1 = [0,1]^ and V„ = {Xi, X 2 , . . . , X„}, where Xx,X 2 , ...,X„ are indepen- 
dent. To emphasize that we are working in a product space, we use the notations L p (x) := 
L p ({xi,x 2 , ■ ■ • ,£„}), L p (Xi :n ) := L p (V n ) = i p ({Xi,X 2 , . . . ,X„}) and M := M(L p (X 1:n )). 
Let A = {x 6 : £ p (x) < M}. By smoothness of L p there exists a constant C > such that 



L p (x)<L p (y) + C-ff(x,y) 



i-p/d 



Therefore, L p (x) > M + t implies that x e Au i C y/{d- P ). Hence for a random Xi :n = 
(Xi, X 2 , . . . , X n ) 

Pr[i p (X lin ) > M + t] < Pr[X 1: „ G A (t/C) w- P) ] < e -e^*'^ fn) 

by the isoperimetric inequality. Similarly, we set B = A and note that by smoothness we have also 
the reversed inequality 

^(y)<i P (x) + C-i/(x, y ) 1 -f/' i . 



Therefore, L p (x) < M + t implies that x G B/ t / c y/{d- P ) . By the same argument as before 



Pr[L p (X 1: „) < M + t] < Pr[X 1:n G B (t/c)d/id - p) ] < -— e '^ 2d/id - p) M 



The theorem follows by the union bound and the fact that Pr[^4] = Pr[i3] = 1/2. □ 

Corollary 22 (Deviation of the Mean and the Median). Let V„ consists ofn points drawn i.i.d. from 
an absolutely continuous probability distribution over [0, l] d , let < p < d and S C N + a finite 
set. Then 

\E[L p (V n )} - M(L p (V n ))\ < 0{n l / 2 ~v/W) . 
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Proof. For conciseness let L p — L p (V n ) and M — M(L p (V n )). We have 

\E[L P ] - M\ < E\L p - M\ 

Pr[|L p - M| > t] At 



< 



3 _e(t 2d /( d -p>/„) 



At 



l/2-p/(2d) 



□ 



Putting these pieces together we arrive at what we wanted to prove: 

Corollary 23 (Concentration). Let V n consists of n points drawn i.i.d. from an absolutely continu- 
ous probability distribution over [0, l] d , let < p < d and S C N + and finite. For any S > with 
probability at least 1 — 5, 



\E[L p (V n )} -L p (V n )\ 



< 



0(nlog(l/(5)) 1/2 - p/(2d) 



(23) 



Proof of Theorem^ By scaling and translation, we can assume that the support of /i is contained in 
the unit cube [0, lp. The first part of the theorem follows immediately from Theorem [6] To prove 
the second part observe from (|23]l that for any 6 > with probability at least 1 — S, 



E[L p (V n )} L p (V n ) 



i-p/d 



- t n 



i-p/d 



< 



0(n- 1 / 2 +f/(2d)( log ( 1/( j))i/2-p/(2d) 



(24) 



It is easy to see that if0<p<d-l then -1/2 + p/(2d) < - gzfgj^ < 0, and if d - 1 < p < d 
then -1/2 +p/{2d) < -^^y < 0. Now using |24|, Theorem |7j and the triangle inequality, we 



have that for any 5 > with probability at least 1 — S, 

/ 1 ~ p / d (x)dx < 



L p (V n ) 



[0,1]' 



E[L p (V n )} 



L p (V n ) 

ry n l-p/d 



E[L p {V n )} 



[0,l] d 



f 1 -*'^) dx 



< 



O (n~^^)(\og(l/8)) 1 / 2 -Pl (2d ^ , if0<p<d 



1 



O (n"W i T(log(l/5)) 1 /2-p/(2d) 



if d — l<p< d . 

□ 



To finish the proof of |7| exploit the fact that log(l ± x) = ±0(x) for x —¥ 0. 

E Copulas and Estimator of Mutual Information 

The goal of this section is to prove Theorem [i] on convergence of the estimator I a . The main 
additional problem that we need to deal with in the proof is the effect of the empirical copula trans- 
formation. A version of the classical Kiefer-Dvoretzky-Wolfowitz theorem due to Massart gives a 
convenient way to do it; see e.g. Devroye and Lugosi ( 200TJ. 

Theorem 24 (Kiefer-Dvoretzky-Wolfowitz). Let X\,X%, . . . , X n be an i.i.d. sample from a proba- 
bility distribution over R with c.d.f. F : R — > [0, 1]. Define the empirical c.d.f. 



F(x) = -\{i : 1 < i < n, X l < x}\ 
n 



for x € 



Then, for any t > 0, 



Pr 



sup|F(x) - F(x)\ > t 



< 2e 



-2nt J 



As a simple consequence of the Kiefer-Dvoretzky-Wolfowitz theorem, we can derive that F is a 
good approximation of F. 
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Lemma 25 (Convergence of Empirical Copula). Let Xi, X2, . . . , X n be an Ltd. sample from a 
probability distribution over M. d with marginal c.d.f.'s Fi, F%, . . . , Fd. Let F be the copula defined 
by |9]) and let F be the empirical copula transformation defined by Then, for any t > 0, 



Pr 



sup ||F(x)-F(x)|| 2 >t 



< 2de 



-Indt 1 



Proof. Using || ■ H2 < \/d\\ ■ ||oo in K d and union-bound we have 



Pr 



sup ||P(x)-F(x)|| a >* 

xGR d 



< Pr 



= Pr 



sup ||F(x)-F(x)|| 0O >tVd 



sup max \Fj(x) — Fj(x)\ > W d 

a;eRl<J<d 



i=l 



sup \Fj{x) -Fj(x)\ > tVd 



< 2de 



-2ndt A 



□ 



The following corollary is an obvious consequence of this lemma: 

Corollary 26 (Convergence of Empirical Copula). Let X 1; X 2 . . . . , X„ be an i.i.d. sample from a 
probability distribution over M. d with marginal c.d.f.'s F\, F%, . . . , F& Let F be the copula defined 
by |9]), and let F be the empirical copula transformation defined by ( |10[ ). Then, for any S > 0, 



Pr 



max ||F(Xi) - F(X. t 

Ki<n 



< 



log(2d/J) 
2nd 



> 1 



(25) 



Proposition 27 (Order statistics). Let 01, 02,. .. ,a m andbi,b2,. ■■ ,b m be real numbers. Let an\ < 
a (2) < • ■ • < C(m) fln ^ &(i) < &(2) < • • ■ < &(m) the same numbers sorted in ascending order. 
Then, \a^ — b^ \ < max.,- \a,j — bf\,forall 1 < i < m. 



Proof. The proof is left as an exercise for the reader. 



□ 



Lemma 28 (Perturbation). Consider points xi, x 2 , . . . , x„, yi, y2, . . . , y„ € Mr such that ||xj 
y,-|| < e for all 1 < i < n. Then, 



|L p ({xi,x 2 , . . . ,x„}) - L p ({yi,y 2 , ■ ■ ■ ,y n })\ < 



O {neP) , ifO < p < 1 ; 
O(ne), ifl<p. 



Proof. Let k = maxS, A = {xi,x 2 , . . . ,x„} and B = {yi,y 2 , . • . ,y„}. Let w A (i,j) = - 
Xj\\ p and WB(i, j) = ||yj — Yj\\ p be the edge weights defined by A and B respectively. Let aLs be 
the p-th power of the distance from Xj to its j-th nearest-neighbor in A, for 1 < i < n ,1 < j < n—1. 
Similarly, let bh\ be the p-th power of the distance from y; to its j-th nearest-neighbor in B. Note 

that for any i, if we sort the real numbers WA(i, 1), • • • , w A (i, i — 1), wa (i> i + 1), . . . , WA(i, n), 
then we get a 1 ,^ < a) 2 \ < • • • < a \ n -i)- Similarly for log's and b^s's. Using these notations we 
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can write 



\L P {A) - L V {B)\ 



EE 

i=i jes 



<EEh 

»=i ies 



(i) U U) 



< > > max 

^-^ ^— ' Ki,j<n 



fl a)-%) 



< VVmax|wi(i,j) - j)| 

— — 2,7 

»=i jes 

<kn max j) - w B (i, j)\ ■ 

l<i,j<n 



The third inequality follows from Proposition 27 It remains to bound \wA(i,j) — w B (i, We 
consider two cases: 



Case < p < 1. Using \u p — v p \ < \u — v\ p valid for any u, v > and the triangle inequality 



|a-b| 



|c-d| 



< a-c| 



|b-d|| 



(26) 



valid for any a, b,c,d£ M. d we have 

\w A (i,j)-w B (i,j)\ = |||x. t -xj p - ||y 4 -yj p | 
<|||x i -x i ||-||y i -y J .||r 

< (l|x i -y i || + ||x i -y i ||f 

< 2 p e p . 

Case p > 1. Consider the function f(u) — u p on interval [0, \fd\. On this interval |/'(it)| < 
p^(p-!)/ 2 an( j so / j s Lipschitz with constant pS- p ~ x ^ 2 . In other words, for any u, v € [0, a/5], 
K - w p | < pd^- 1 ^ 2 ^ - v\. Thus 

\wA(i,j) ~ w B {i,j)\ = |||Xj - Xj-|| p - ||y; - yj\\ p \ 

<^- 1 )/ 2 |||x,-x J ||-||y s ;-y J ||| 

<^b-i)/2(|| Xi _ yi || + || x ._ y .||) 

< 2epS p '^' 2 , 

where the second inequality follows from (|26]>. □ 



Corollary 29 (Copula Perturbation). L<?f Xi, X2, . . . , X„ be an i.i.d. sample from a probability 
distribution over M. d with marginal c.d.f.'s Fi, F2, . ■ ■ , Fa . Let F be the copula defined by Q and let 
F be the empirical copula transformation defined by (701. Let Z,; = F(X^) and = F(Xi). Then 
for any 8 > 0, with probability at least 1 — 6, 



Lp{7i\: n ) L p (Zi :n ) 



< 



O (n p / d - p / 2 (\og{l/8)) p / 2 ) , if0<p<l 

0( n P/rf-l/2( log ( 1 /^)l/2j j ifl< p . 



Proof. It follows immediately from Corollary 26 and Lemma [28] that with probability at least 1—5, 

'O (n 1 - p / 2 {\og{l/5)) p / 2 ) , if < p < 1 ; 
O^ilogil/S)) 1 / 2 ), ifl<p. 

□ 



|L p ({Z l!B }) - Lp(Zi :n )| < 



We are now ready to give the proof of Theorem [3] 
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Proof of Theorem^ Let g denote the density of the copula of [i. The first part follows from 
Corollary 29 and a standard Borel-Cantelli argument with S = 1/n 2 . Corollary 29 puts the restric- 
tions d > Tand 1/2 < a < 1. 

The second part can be proved along the same lines. From |7]i we have that for any <5 > with 
probability at least 1 — 5, 



771 



i- P /d 



9 



i-p/d 



[o,i] d 



(x) dx 





H 


<-\ 






H 



rl -3W%(log(l/(J)) 1 /2-p/(2<j) 



ifO<p<d— 1; 
ifd— l<p<d. 



Hence using the triangle inequality again, and exploiting that (log (l/5)) x /2-p/(2d) < (l g(l/5)) 1 /2 
if < p, S < 1, we have that with probability at least 1 — 5, 



in^-p/ d J [0>1]d 



<J 



(x) dx 



O fmax{n~3(5^,7i-P/ 2 +P/ rf } v / log(T75) 
< { O hnax{n"3(^),n- 1 / 2 +P/ d }v'log(l/(5) 
O ( max{ra~ 3$S> , n~ l / 2 +P/ d ] y/\og(l/S) 



To finish the proof exploit that when x — > then log(l ± x) = ±0(x). 



if < p < 1 ; 

ifl<P<d-l; 

ifd-l<p<d. 

□ 
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